第 12 屆 iThome 鐵人賽

DAY 8

AI & Data

Machine Learning與軟工是否搞錯了什麼?系列第 8 篇

Day 8 Q learning如何實現

12th鐵人賽 machine learning

linitachi

2020-09-08 00:10:41

7374 瀏覽

分享至

Q learning如何實現

今天我們就要來看看如何實現Q learning!
code參考這篇製作Q* Learning with FrozenLakev2.ipynb

前置作業

我們使用Colab來當作我們的實作平台，並使用Open ai來完成。

FrozenLake

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.
擷取至https://gym.openai.com/envs/FrozenLake-v0/

簡單來說，就是要讓玩家到達目的地就贏了。

地圖

4X4的地圖

S代表出發點，F是可以走的路，H是破洞(走到會死掉)，G是終點(走到就贏了)

動作

上下左右

狀態

共有16格，所以有16個state

Q table

因為每一個狀態都可以有4種動作，所以Q table大小為
$4×4×4=64$

初始化

先生成一個都為0的Q表
透過env.action_space.n以及env.observation_space.n就可以得到Q表的長以及寬。

import numpy as np
import gym
import random
env = gym.make("FrozenLake-v0")
action_size = env.action_space.n
state_size = env.observation_space.n
# Create our Q table with state_size rows and action_size columns (64x4)
qtable = np.zeros((state_size, action_size))

參數設置

total_episodes = 20000       # Total episodes
learning_rate = 0.8          # Learning rate
max_steps = 50               # Max steps per episode
gamma = 0.95                 # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.005            # Exponential decay rate for exploration prob

演算法

# List of rewards
rewards = []

# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])
            #print(exp_exp_tradeoff, "action", action)

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()
            #print("action random", action)
            
        
        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)
    

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

可以看到在第18行的部分，如果exp_exp_tradeoff大於epsilon時，就會以Q表的結果來選擇動作，否則就隨機做一個動作。

33行的部分即是Q learning的精華!複習一下

另外在第44行的部分，為了減少探索的機率，因此會慢慢地減少epsilon值。

# 輸入這行即可得到環境目前的狀況
env.render()

結果

最快13步即可達到目的地!

結論

今天看到了Q learning的實作，並了解其中如何實現。

參考資料

https://gym.openai.com/envs/FrozenLake-v0/
Q* Learning with FrozenLakev2.ipynb

Day 7 強化學習之Q learning

Day 9 DQN是不良人物?!

系列文

Machine Learning與軟工是否搞錯了什麼? 共 30 篇

RSS系列文訂閱系列文

25 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22195 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

Machine Learning與軟工是否搞錯了什麼?系列 第 8 篇